Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that $\mathrm{DLR}$ is as performant as previously-proposed SSMs in the presence of strong supervision, despite being conceptually much simpler. Moreover, we characterize the expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via $\textit{few}$ convolutional kernels, they struggle on tasks requiring $\textit{many}$ such kernels and especially when the desired sequence manipulation is $\textit{context-dependent}$. For example, $\mathrm{DLR}$ learns to perfectly shift a $0.5M$-long input by an arbitrary number of positions but fails when the shift size depends on context. Despite these limitations, $\mathrm{DLR}$ reaches high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$ and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths $8K$ and $65K$ respectively, and gives encouraging performance on $\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$ for which attention is not a viable choice.
translated by 谷歌翻译
理解基于变压器的模型引起了极大的关注,因为它们是机器学习最近技术进步的核心。尽管大多数可解释性方法都依赖于输入的运行模型,但最近的工作表明,零通的方法,即直接解释参数而无需前进/向后传递,对于某些变压器参数是可行的,对于两层注意力网络是可行的。在这项工作中,我们提出了一个理论分析,其中通过将其投影到嵌入式空间(即它们操作的词汇量的空间)中来解释训练有素的变压器的所有参数。我们得出一个简单的理论框架来支持我们的论点,并为其有效性提供了充足的证据。首先,经验分析表明,可以在嵌入空间中解释审计和微调模型的参数。其次,我们提出了框架的两个应用:(a)对齐共享词汇的不同模型的参数,以及(b)通过``翻译''''''''分类器构建分类器的参数``翻译'''''''分类器的参数仅鉴定的不同模型。总体而言,我们的发现为解释方法打开了大门,至少部分地从模型细节中抽象出来,仅在嵌入空间中运行。
translated by 谷歌翻译
状态空间模型已显示在建模远距离依赖性方面有效,特别是序列分类任务。在这项工作中,我们着重于对英语书籍,GitHub源代码和Arxiv数学文章的自回旋序列建模。基于围绕封闭激活功能的有效性的最新发展,我们提出了一个名为“封闭状态空间(GSS)”的新层,并表明它的训练速度明显快于TPU的S4(即DSS)的对角线版本,具有相当竞争力 - 基于变压器的基线,并表现出零击向更长的输入,同时直接实施。最后,我们表明,利用自我意见来建模局部依赖性,可以进一步提高GSS的性能。
translated by 谷歌翻译
最近已证明状态空间模型(SSM)是深度学习层非常有效的,它是序列模型(例如RNN,CNN或变压器)的有前途替代方案。第一个显示这种潜力的版本是S4模型,它通过使用称为HIPPO矩阵的规定状态矩阵对涉及长期依赖性的任务特别有效。尽管这具有可解释的数学机制来建模长期依赖性,但它引入了一种自定义表示和算法,可能难以实施。另一方面,最新的S4变体称为DSS,表明将状态矩阵完全对角线限制在使用基于近似S4矩阵的特定初始化时,仍然可以保留原始模型的性能。这项工作旨在系统地了解如何参数化和初始化此类对角线状态空间模型。虽然从经典的结果来看,几乎所有SSM都具有等效的对角线形式,但我们表明初始化对于性能至关重要。我们通过证明S4矩阵的对角线限制出人意料地在无限状态尺寸的极限中恢复了相同的内核来解释为什么DSS在数学上起作用。我们还系统地描述了参数化和计算对角线SSM的各种设计选择,并执行对这些选择的影响的受控经验研究。我们的最终型号S4D是S4的简单对角线版本,其内核计算仅需要2行代码,并且几乎在所有设置中都与S4相当地执行,并具有最新的图像,音频和医疗时间序列域的结果,在远程竞技场基准中平均为85%。
translated by 谷歌翻译
NLP基准在很大程度上主要集中在短篇文本上,例如句子和段落,即使长文本在野外占相当数量的自然语言。我们介绍卷轴,这是一套需要在长文本上推理的任务套件。我们检查现有的长文本数据集,文本自然是长期的,同时优先考虑涉及在输入上扫描信息的任务。滚动包含概述,问题应答和自然语言推理任务,包括多个域,包括文学,科学,业务和娱乐。初始基线(包括啰覆编码器),表明滚动有充足的改进空间。我们以统一的文本到文本格式提供所有数据集,并托管Live Refordboard,以促进模型架构和预用方法的研究。
translated by 谷歌翻译
Existing federated classification algorithms typically assume the local annotations at every client cover the same set of classes. In this paper, we aim to lift such an assumption and focus on a more general yet practical non-IID setting where every client can work on non-identical and even disjoint sets of classes (i.e., client-exclusive classes), and the clients have a common goal which is to build a global classification model to identify the union of these classes. Such heterogeneity in client class sets poses a new challenge: how to ensure different clients are operating in the same latent space so as to avoid the drift after aggregation? We observe that the classes can be described in natural languages (i.e., class names) and these names are typically safe to share with all parties. Thus, we formulate the classification problem as a matching process between data representations and class representations and break the classification model into a data encoder and a label encoder. We leverage the natural-language class names as the common ground to anchor the class representations in the label encoder. In each iteration, the label encoder updates the class representations and regulates the data representations through matching. We further use the updated class representations at each round to annotate data samples for locally-unaware classes according to similarity and distill knowledge to local models. Extensive experiments on four real-world datasets show that the proposed method can outperform various classical and state-of-the-art federated learning methods designed for learning with non-IID data.
translated by 谷歌翻译
The rise in data has led to the need for dimension reduction techniques, especially in the area of non-scalar variables, including time series, natural language processing, and computer vision. In this paper, we specifically investigate dimension reduction for time series through functional data analysis. Current methods for dimension reduction in functional data are functional principal component analysis and functional autoencoders, which are limited to linear mappings or scalar representations for the time series, which is inefficient. In real data applications, the nature of the data is much more complex. We propose a non-linear function-on-function approach, which consists of a functional encoder and a functional decoder, that uses continuous hidden layers consisting of continuous neurons to learn the structure inherent in functional data, which addresses the aforementioned concerns in the existing approaches. Our approach gives a low dimension latent representation by reducing the number of functional features as well as the timepoints at which the functions are observed. The effectiveness of the proposed model is demonstrated through multiple simulations and real data examples.
translated by 谷歌翻译
Landing an unmanned aerial vehicle unmanned aerial vehicle (UAV) on top of an unmanned surface vehicle (USV) in harsh open waters is a challenging problem, owing to forces that can damage the UAV due to a severe roll and/or pitch angle of the USV during touchdown. To tackle this, we propose a novel model predictive control (MPC) approach enabling a UAV to land autonomously on a USV in these harsh conditions. The MPC employs a novel objective function and an online decomposition of the oscillatory motion of the vessel to predict, attempt, and accomplish the landing during near-zero tilt of the landing platform. The nonlinear prediction of the motion of the vessel is performed using visual data from an onboard camera. Therefore, the system does not require any communication with the USV or a control station. The proposed method was analyzed in numerous robotics simulations in harsh and extreme conditions and further validated in various real-world scenarios.
translated by 谷歌翻译
Multiple studies have focused on predicting the prospective popularity of an online document as a whole, without paying attention to the contributions of its individual parts. We introduce the task of proactively forecasting popularities of sentences within online news documents solely utilizing their natural language content. We model sentence-specific popularity forecasting as a sequence regression task. For training our models, we curate InfoPop, the first dataset containing popularity labels for over 1.7 million sentences from over 50,000 online news documents. To the best of our knowledge, this is the first dataset automatically created using streams of incoming search engine queries to generate sentence-level popularity annotations. We propose a novel transfer learning approach involving sentence salience prediction as an auxiliary task. Our proposed technique coupled with a BERT-based neural model exceeds nDCG values of 0.8 for proactive sentence-specific popularity forecasting. Notably, our study presents a non-trivial takeaway: though popularity and salience are different concepts, transfer learning from salience prediction enhances popularity forecasting. We release InfoPop and make our code publicly available: https://github.com/sayarghoshroy/InfoPopularity
translated by 谷歌翻译
The ability for an agent to continuously learn new skills without catastrophically forgetting existing knowledge is of critical importance for the development of generally intelligent agents. Most methods devised to address this problem depend heavily on well-defined task boundaries, and thus depend on human supervision. Our task-agnostic method, Self-Activating Neural Ensembles (SANE), uses a modular architecture designed to avoid catastrophic forgetting without making any such assumptions. At the beginning of each trajectory, a module in the SANE ensemble is activated to determine the agent's next policy. During training, new modules are created as needed and only activated modules are updated to ensure that unused modules remain unchanged. This system enables our method to retain and leverage old skills, while growing and learning new ones. We demonstrate our approach on visually rich procedurally generated environments.
translated by 谷歌翻译